1 Introduction

This notebook presents the first empirical study of the paper Forking paths in empirical studies available on SSRN.
The idea of the paper is to decompose any empirical analysis into a series of steps, which are pompously called mapppings in the paper.
Each step matters and can have an important impact on the outcome of the study.
Because of that, we argue that the researcher should, if possible, keep track of all the possible modelling options and present the distribution of outcomes (e.g., t-statistics or p-values) across all configurations.

Below, some code chunks are shown (because they are possibly insightful), others are not (especially for plots).
It is easy to access all code content by clicking on the related buttons.

2 Data

First, we test the data importation & load the libraries.
The data, downloaded from Amit Goyal’s website is stored on a Github repo.
It was used in the follow-up paper A Comprehensive Look at the Empirical Performance of Equity Premium Prediction II.
Because we do not want to download the data multiple times, we do it only once, for three sheets (monthly, quarterly & yearly data).

library(tidyverse)   # Data wrangling & plotting
library(readxl)      # To read excel files
library(zoo)         # For data imputation
library(DescTools)   # For winsorization
library(sandwich)    # HAC estimator
library(lmtest)      # Statistical inference
library(furrr)       # Parallel computing
library(viridis)     # Color palette
library(patchwork)   # Graph layout
library(xtable)      # LaTeX exports
library(reshape2)    # List management
library(stabledist)  # For stable distributions
library(ptsuite)     # For tail estimation

loadWorkbook_url <- function(sheet) { # Function that downloads the data from online file
    url = "https://github.com/shokru/coqueret.github.io/blob/master/files/misc/PredictorData2021.xlsx?raw=true"
    temp_file <- tempfile(fileext = ".xlsx")
    download.file(url = url, destfile = temp_file, mode = "wb", quiet = TRUE)
    read_excel(temp_file, sheet = sheet)
}

data_month <- loadWorkbook_url(1)    # Dataframe for monthly data
data_quarter <- loadWorkbook_url(2)  # Dataframe for quarterly data
data_year <- loadWorkbook_url(3)     # Dataframe for annual data

3 Mappings

Next, we code the modules. They correspond to the \(f_j\) mappings in the paper.
The chaining of mappings follows the scheme below:

Because of the strong analogy with neural networks, we approach the coding of mappings/modules similarly to the early implementation of the sequential Keras framework.

3.1 Sheet selection

First module: decide which sheet to work with (i.e., horizon: monthly, quarterly or annually).

prob_sheet <- c(0,1,0)
module_sheet <- function(prob_sheet){
    prob_month <- prob_sheet[1]                                           # Prob. to pick sheet 1 (monthly)
    prob_quarter <- prob_sheet[2]                                         # Prob. to pick sheet 2 (quarterly)
    prob_year <- prob_sheet[3]                                            # Prob. to pick sheet 3 (yearly)
    if((prob_month == 0) + (prob_quarter == 0) + (prob_year == 0) == 2){
        if(prob_month == 1){return(data_month)}
        if(prob_quarter == 1){return(data_quarter)}
        if(prob_year == 1){return(data_year)}
    } else
        p = runif(1)                                                        # This is the random option (not used below)
    if(p <= prob_month){return(data_month)}
    else if(p <= prob_month+prob_quarter){return(data_quarter)}
    else {return(data_year)}
}
module_sheet(prob_sheet)[250:260, 1:11] |>                               # A snapshot
    mutate(across(everything(), as.numeric)) |> 
    knitr::kable(font_size = 5)
yyyyq Index D12 E12 b/m tbl AAA BAA lty cay ntis
19332 10.91 0.4700 0.4250 0.8335032 0.0007 0.0446 0.0707 0.0306 NaN 0.0007609
19333 9.83 0.4550 0.4325 0.8679966 0.0004 0.0436 0.0727 0.0308 NaN -0.0004585
19334 10.10 0.4400 0.4400 0.8290260 0.0029 0.0450 0.0775 0.0336 NaN 0.0061958
19341 10.75 0.4425 0.4525 0.8025122 0.0024 0.0413 0.0626 0.0307 NaN 0.0096063
19342 9.81 0.4450 0.4650 0.8407311 0.0015 0.0393 0.0606 0.0289 NaN 0.0059828
19343 9.10 0.4475 0.4775 0.8703644 0.0021 0.0396 0.0657 0.0310 NaN 0.0194390
19344 9.50 0.4500 0.4900 0.7737409 0.0023 0.0381 0.0623 0.0293 NaN 0.0208127
19351 8.47 0.4500 0.7300 0.8007541 0.0015 0.0367 0.0620 0.0274 NaN 0.0120460
19352 10.23 0.4400 0.8100 0.6818182 0.0015 0.0361 0.0577 0.0270 NaN 0.0193440
19353 11.59 0.4400 0.7600 0.6117344 0.0020 0.0359 0.0553 0.0282 NaN 0.0080266
19354 13.43 0.4700 0.7600 0.5599112 0.0015 0.0344 0.0530 0.0276 NaN 0.0086888

After the first module, each mapping takes data as first argument. This will make it easy to use pipes (sequences of instructions).

3.2 Handling missing data

Second module: data cleaning.
NOTE: all columns must be converted to numbers beforehand.

module_cleaning <- function(data, prob_remove){                           # Two inputs: data & choice
    data <- data |> mutate(across(everything(), as.numeric))             # Convert everything to numbers
    if(runif(1) <= prob_remove){                                          # If remove:
        return(data |> na.omit())                                        # REMOVE!
    } else
        return(data |> mutate(across(everything(), na.locf, na.rm = F))) # Impute from previous non-NA (not very useful here)
}

3.3 Winsorizing

Third module: winsorizing.
We compute some values of predictors in this module (from the raw series): payout, dfr and dfy.
The winsorization comes after. The only input is the level for the winsorization.

module_winsorization <- function(data, threshold){
    data <- data |> mutate(payout = log(D12/E12),
                            dfy = BAA - AAA,
                            dfr = corpr - ltr)
    data <- data |> mutate(across(2:ncol(data), Winsorize, probs = c(threshold, 1-threshold), na.rm = T))
    return(data)
}

3.4 Levels vs. differences

Fourth module: levels vs. differences.
This step decides if independent variables are left as levels, or if variations are used instead.

differentiate <- function(v){v - lag(v)}

module_levels <- function(data, prob_levels){
    if(runif(1) < prob_levels){return(data)}
    else return(data |> mutate(across(3:ncol(data), differentiate)) |> na.omit())
}

3.5 Choice of predictor

Fifth module: independent variable.
This stage sets the (unique) predictor. There are 6 in the study.

module_independent <- function(data, variable){
    data |> dplyr::select(all_of(c("Index", variable, "Rfree")))
}

3.6 Horizon for returns

Sixth module: horizon.
This module determines the horizon of the dependent variable (return).
It is given in periods (thus, results are contingent on the initial choice of the sheet (first module)).

module_horizon <- function(data, horizon){
    data[,1] <- lead(data[,1], horizon)/data[,1] - 1 - horizon * data[,3]  # Excess returns on the Index (S&P 500)
    data
}

3.7 Subsampling: starting point

Seventh module: starting point.
In our study, we allow for simple subsampling.
The idea is that predictability should not be (too much) time-dependent (though this may be the case).
Thus, we allow for 3 sample size options:
- full sample
- sample that begins in the middle of the available data (second half)
- sample that stop at the the middle of the data (first half)

This module picks the starting point (either the original first point, or the middle point).

module_start <- function(data, start_quantile){
    L <- nrow(data)
    data[ceiling(L*start_quantile):L, ]
}

3.8 Subsampling: stopping point

Eighth module: stopping point.

This module picks the stopiing point (either the original last point, or the middle point).

module_stop <- function(data, stop_quantile){
    L <- nrow(data)
    data[1:ceiling(L*stop_quantile), ]
}

3.9 Estimator type

Ninth module: estimator type.
This last step takes the data and performs estimations.
Two options are possible: the standard iid estimator, or the HAC estimator from Newey-West.
The final function yields 10 outputs:
- the coefficient, the standard error, the t-statistic, the p-value, the skewness of errors, which are baseline outputs;
- the two criteria (AIC & BIC), for frequentist model averaging;
- the sample size, the sum of squared residuals and the variance of the dependent variable, for the Bayesian model averaging.

module_estimator <- function(data, estimator){
    data <- data |> na.omit()                                        # Removes NAs
    y <- data |> pull(1)                                             # Naming the dependent variable
    x <- data |> pull(2)                                             # Naming the independent variable
    x <- x / sd(x)                                                    # Scaling the predictors
    
    if(nrow(data) < 31){                                              # Test for sample size
        t_stat <- NA                                                    # NA if too small (not enough obs.)
        sd <- NA
        coef <- NA
        p_val <- NA
        aic <- NA
        bic <- NA
        sum_sq_err <- NA
        var_y <- NA
        skew <- NA
    } else {
        if(estimator == "HAC"){
            model <- lm(y ~ x)                                        # Running the regression
            coef <- summary(model)$coef[2,1]                          # OLS coefficient
            sd <- coeftest(model, vcov. = NeweyWest(model))[2,2]      # NW standard error
            t_stat <- coeftest(model, vcov. = NeweyWest(model))[2,3]  # OLS with Newey-West Cov matrix
            p_val <- coeftest(model, vcov. = NeweyWest(model))[2,4]   # p-value
            aic <- AIC(model)                                         # Akaike Info Criterion
            bic <- BIC(model)                                         # Bayesian Info Criterion
            sum_sq_err <- sum(model$residuals^2)                      # Sum of squared residuals
            var_y <- var(y)                                           # Variance of dependent variable
            skew <- mean((model$residuals/sd(model$residuals))^3)     # Skewness of residuals
        } 
        if(estimator == "OLS"){
            model <- lm(y ~ x)                                        # Running the regression
            coef <- summary(model)$coef[2,1]                          # OLS coef
            sd <- summary(model)$coef[2,2]                            # OLS coef
            t_stat <- summary(model)$coef[2,3]                        # OLS t-statistic
            p_val <- summary(model)$coef[2,4]                         # p-value
            aic <- AIC(model)                                         # Akaike Info Criterion
            bic <- BIC(model)                                         # Bayesian Info Criterion
            sum_sq_err <- sum(model$residuals^2)                      # Sum of squared residuals
            var_y <- var(y)                                           # Variance of dependent variable
            skew <- mean((model$residuals/sd(model$residuals))^3)     # Skewness of residuals
        }
        if(estimator == "AUG"){
            model_x <- lm(x ~ lag(x))                                 # x-regression
            rho <- model_x$coef[2]
            nu <- lead(x) - rho * x                                   # x-innovations            
            model <- lm(y ~ x + nu)
            coef <- summary(model)$coef[2,1]                          # OLS coef
            sd <- summary(model)$coef[2,2]                            # OLS coef
            t_stat <- summary(model)$coef[2,3]                        # OLS t-statistic
            p_val <- summary(model)$coef[2,4]                         # p-value
            aic <- AIC(model)                                         # Akaike Info Criterion
            bic <- BIC(model)                                         # Bayesian Info Criterion
            sum_sq_err <- sum(model$residuals^2)                      # Sum of squared residuals
            var_y <- var(y)                                           # Variance of dependent variable
            skew <- mean((model$residuals/sd(model$residuals))^3)     # Skewness of residuals
        }    
        if(estimator == "AUG-HAC"){
            model_x <- lm(x ~ lag(x))                                 # x-regression
            rho <- model_x$coef[2]
            nu <- lead(x) - rho * x                                   # x-innovations            
            model <- lm(y ~ x + nu)
            coef <- summary(model)$coef[2,1]                          # OLS coef
            sd <- coeftest(model, vcov. = NeweyWest(model))[2,2]      # NW standard error
            t_stat <- coeftest(model, vcov. = NeweyWest(model))[2,3]  # OLS with Newey-West Cov matrix
            p_val <- coeftest(model, vcov. = NeweyWest(model))[2,4]   # p-value
            aic <- AIC(model)                                         # Akaike Info Criterion
            bic <- BIC(model)                                         # Bayesian Info Criterion
            sum_sq_err <- sum(model$residuals^2)                      # Sum of squared residuals
            var_y <- var(y)                                           # Variance of dependent variable
            skew <- mean((model$residuals/sd(model$residuals))^3)     # Skewness of residuals
        }    
    }
    return(list(coef = coef, sd = sd, t_stat = t_stat, p_val = p_val, 
                sample_size = nrow(data), aic = aic, bic = bic,
                sum_sq_err = sum_sq_err, var_y = var_y, skew = skew))
}

4 Aggregation & spanning

4.1 Module chaining

We show a simple test below.
It shows the natural chaining of the mappings, akin to a neural network (without back-propagation!).
All modules are executed successively.

prob_sheet <- c(1,0,0)   # Monthly data
prob_remove <- 1         # Remove NAs
threshold <- 0.01        # Threshold for winsorization
prob_levels <- 0         # Switch to differences
variable <- "D12"        # Which independent variable 
horizon <- 3             # Return horizon
start_quantile <- 0.5    # Starting point
stop_quantile <- 1       # Stopping point
estimator <- "AUG"       # Which estimator

module_sheet(prob_sheet = prob_sheet) |>              # Step 1: select data frequency
    module_cleaning(prob_remove = prob_remove) |>     # Step 2: clean the data
    module_winsorization(threshold = threshold) |>    # Step 3: manage outliers
    module_levels(prob_levels = prob_levels) |>       # Step 4: decide between levels versus differences
    module_independent(variable = variable) |>        # Step 5: choose predictor
    module_horizon(horizon = horizon) |>              # Step 6: pick return horizon (dependent variable)
    module_start(start_quantile = start_quantile) |>  # Step 7: starting point (for subsampling)
    module_stop(stop_quantile = stop_quantile) |>     # Step 8: ending point (for subsampling)
    module_estimator(estimator = estimator)            # Step 9: select estimator type
## $coef
## [1] 0.006294986
## 
## $sd
## [1] 0.003970508
## 
## $t_stat
## [1] 1.585436
## 
## $p_val
## [1] 0.1136841
## 
## $sample_size
## [1] 391
## 
## $aic
## [1] -882.1459
## 
## $bic
## [1] -866.2813
## 
## $sum_sq_err
## [1] 2.329944
## 
## $var_y
## [1] 0.006023265
## 
## $skew
## [1] -0.4276608

Below, we code a function that takes all parameters and that runs the whole chain of operators/mappings.
Because we consider a uniform distribution for the sheet (horizons), we recode the function slightly so that the corresponding input is a simple string.

module_aggregator <- function(sheet = sheet,
                              prob_remove = prob_remove,
                              threshold = threshold,
                              prob_levels = prob_levels,
                              variable = variable,
                              horizon = horizon,
                              start_quantile = start_quantile,
                              stop_quantile = stop_quantile,
                              estimator = estimator){
    if(sheet == "Monthly"){prob_sheet <- c(1,0,0)}
    if(sheet == "Quarterly"){prob_sheet <- c(0,1,0)}
    if(sheet == "Yearly"){prob_sheet <- c(0,0,1)}
    module_sheet(prob_sheet = prob_sheet) |>
        module_cleaning(prob_remove = prob_remove) |>
        module_winsorization(threshold = threshold) |>
        module_levels(prob_levels = prob_levels) |>
        module_independent(variable = variable) |>
        module_horizon(horizon = horizon) |>
        module_start(start_quantile = start_quantile) |>
        module_stop(stop_quantile = stop_quantile) |>
        module_estimator(estimator = estimator)
}

4.2 Spanning all design options

Next, we span all the combinations we want to test. Implicitly, they are given the same (uniform) probability.

sheet <- c("Monthly", "Quarterly", "Yearly")
prob_remove <- c(0, 1)
threshold <- c(0, 0.01, 0.02, 0.03)
prob_levels <- c(0, 1)
variable <- c("payout", "b/m", "svar", "dfr", "dfy", "ntis")
horizon <- c(1, 3, 6, 12)
start_quantile <- c(0, 0.5)
stop_quantile <- c(0.5, 1)
estimator <- c("OLS", "HAC", "AUG", "AUG-HAC")

pars <- expand_grid(sheet, prob_remove, threshold, prob_levels, variable, horizon, start_quantile, stop_quantile, estimator)
pars <- pars |> filter(!(start_quantile == 0.5 & stop_quantile == 0.5))

# sheet <- pars[,1]
# prob_remove <- pars[,2]
# threshold <- pars[,3]
# prob_levels <- pars[,4]
# variable <- pars[,5]
# horizon <- pars[,6]
# start_quantile <- pars[,7]
# stop_quantile <- pars[,8]
# estimator <- pars[,9]

We then launch the computation of the outputs across all 13824 combinations.
Note that with the functional programming abilities of R, this is incredibly code efficient.
To speed up things, we resort to parallelization over 4 cores.

plan(multisession, workers = 4)                     # Set the number of cores
output <- future_pmap_dfr(pars,  module_aggregator) # Launch!
results <- bind_cols(pars, output)                  # Bind parameters to results
save(results, file = "results.RData")

5 Post treatment

Here, we adjust the distribution of t-statistics to account for the possible error-in-variable (EIV) bias, as in Proposition 2 of Skill, Scale, and Value Creation in the Mutual Fund Industry.

bias_terms <- function(x, m, s, t){ # We follow the notations of Proposition 2
    # The first 5 values are the sample moments of the bivariate series
    m_m <- mean(m, na.rm = T)
    s_m <- sd(m, na.rm = T)
    m_s <- mean(s, na.rm = T)
    s_s <- sd(s, na.rm = T)
    rho <- cor(m,s, use = "complete.obs")
    # Next, the optimal h, along with bar(m1) and bar(m2)
    h <- (3/4/s_m^5*(rho^4*s_s^4/12-(rho*s_s)^2+1)*exp(m_s+s_s^2*(1-rho^2/2)/2)*(length(m)/mean(t)))^(-1/3)
    m1 <- (x-m_m)/s_m
    m2 <- (x-m_m-rho*s_m*s_s)/s_m
    bs1 <- h^2/2/s_m^2*(m1^2-1)/s_m*dnorm(m1)
    bs2 <- (1/2/mean(t)*exp(m_s+s_s^2/2)/s_m^2*(m2^2-1))/s_m*dnorm(m2)
    return(bs1 + bs2)
}

adjust_bias <- function(xx, m, s, t){
    integrate(bias_terms, 
              -Inf, xx, 
              m = m, s = s, t = t)$value
}

adjust_cdf_v <- Vectorize(adjust_bias, vectorize.args='xx')

# Fine discretization
grid_01 <- seq(min(results$t_stat, na.rm = T)-0.1, max(results$t_stat, na.rm = T)+0.1, length.out = 2*nrow(results))
# Next step is long because of integration
Phi_adj <- adjust_cdf_v(grid_01,
           m = results$t_stat,
           s = results$sd,
           t = results$sample_size) + ecdf(results$t_stat)(grid_01)
Phi_adj <- Phi_adj[Phi_adj < 1] |> sort()
# Create inverse function by swapping x & y
quant <- stepfun(Phi_adj, grid_01[1:(length(Phi_adj)+1)])

res_adj <- results
res_adj$t_stat <- quant(ecdf(results$t_stat)(results$t_stat))
res_adj <- res_adj |>  mutate(adjusted = "Yes")
res_adj$coef <- NA
res_adj$sd <- NA
res_adj$p_val <- 2*(1-pt(abs(res_adj$t_stat), results$sample_size))

results <- results |> mutate(adjusted = "No")
results <- bind_rows(results, res_adj)

results # Have a look
## # A tibble: 27,648 × 20
##    sheet   prob_remove threshold prob_…¹ varia…² horizon start…³ stop_…⁴ estim…⁵
##    <chr>         <dbl>     <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl> <chr>  
##  1 Monthly           0         0       0 payout        1     0       0.5 OLS    
##  2 Monthly           0         0       0 payout        1     0       0.5 HAC    
##  3 Monthly           0         0       0 payout        1     0       0.5 AUG    
##  4 Monthly           0         0       0 payout        1     0       0.5 AUG-HAC
##  5 Monthly           0         0       0 payout        1     0       1   OLS    
##  6 Monthly           0         0       0 payout        1     0       1   HAC    
##  7 Monthly           0         0       0 payout        1     0       1   AUG    
##  8 Monthly           0         0       0 payout        1     0       1   AUG-HAC
##  9 Monthly           0         0       0 payout        1     0.5     1   OLS    
## 10 Monthly           0         0       0 payout        1     0.5     1   HAC    
## # … with 27,638 more rows, 11 more variables: coef <dbl>, sd <dbl>,
## #   t_stat <dbl>, p_val <dbl>, sample_size <int>, aic <dbl>, bic <dbl>,
## #   sum_sq_err <dbl>, var_y <dbl>, skew <dbl>, adjusted <chr>, and abbreviated
## #   variable names ¹​prob_levels, ²​variable, ³​start_quantile, ⁴​stop_quantile,
## #   ⁵​estimator
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

6 Distribution of statistics

Below, we plot the distribution of t-statistics and p-values.

results |> ggplot(aes(x = t_stat, fill = sheet)) + 
    geom_histogram(position = "dodge") + theme_minimal() +
    scale_fill_viridis(discrete = T, begin = 0.1, end = 0.9, name = "Frequency", option = "plasma") + 
    theme(axis.title.y = element_blank(),
          legend.position = c(0.8,0.8))

p1 <- results |> 
    mutate(type = p_val < 0.1) |>
    ggplot(aes(x = p_val, fill = type)) + geom_histogram(bins = 40) + 
    theme_minimal() + 
    scale_fill_manual(values = c("#ADADAD", "#666666")) +
    theme(axis.title = element_blank(), 
          text = element_text(size = 14),
          legend.position = "none") +
    annotate("rect", xmin = 0.19, xmax = 0.55, ymin = 500, ymax = 580, fill = "#FFFFFF") +
    annotate("text", x = 0.39, y = 560, label = "bold(full~distribution)", parse = TRUE)
p2 <- results |> ggplot(aes(x = p_val)) + geom_histogram(bins = 20, aes(y=..density..), position = "identity") + 
    theme_minimal() + xlim(0,0.1) + ylim(0,17) + 
    theme(axis.title = element_blank(),
          panel.background = element_rect(fill = "white", color = "white"),
          panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                          colour = "#CCCCCC"),
          plot.background = element_rect(fill = "white", color = "white")) +
    annotate("rect", xmin = 0.03, xmax = 0.07, ymin = 11, ymax = 15, fill = "#FFFFFF") +
    annotate("text", x = 0.05, y = 13, label = "bold(p-curve)", parse = TRUE) +
    geom_density(color = "#38BDFC", size = 1.3)
p3 <- p1 + inset_element(p2, left = 0.45, bottom = 0.55, right = 1, top = 1)
p4 <- results |> ggplot(aes(x = p_val, color = variable)) + stat_ecdf(geom = "step") + 
    theme_minimal() + 
    theme(legend.position = c(0.8,0.3),
          text = element_text(size = 14),
          legend.background = element_rect(fill = "white", color = "white"),
          legend.title = element_text(face = "bold"),
          axis.title = element_blank())
p3 + p4 + plot_layout(widths = c(3, 2))

ggsave("p_val_dist.pdf", width = 8, height = 4)

6.1 Impact of mappings

Effects of dual operators: we show the histograms of t-stats when we fix one choice for binary mappings.

g_remove <- results |>
    ggplot(aes(x = t_stat, fill = as.factor(prob_remove))) + theme_minimal() + 
    geom_histogram(position = "identity", alpha = 0.7) +
    theme(legend.position = "top",
          text = element_text(size = 14),
          axis.title = element_blank(),
          legend.title = element_text(face = "bold", size = 12)) + 
    scale_fill_manual(name = "Data cleaning", labels = c("imputation", "exclusion"), values = c("#66EEFF", "#555555")) + 
    guides(fill = guide_legend(nrow = 2, byrow = TRUE))
g_levels <- results |>
    ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) + theme_minimal() + 
    geom_histogram(position = "identity", alpha = 0.7) +
    theme(legend.position = "top",
          text = element_text(size = 14),
          axis.title = element_blank(),
          legend.title = element_text(face = "bold", size = 12)) + 
    guides(fill = guide_legend(nrow = 2, byrow = TRUE)) +
    scale_fill_manual(name = "Variable engineering", labels = c("differences", "levels"), values = c("#FF66FF", "#555555"))
g_est <- results |>
    ggplot(aes(x = t_stat, fill = as.factor(estimator))) + theme_minimal() + 
    geom_histogram(position = "identity", alpha = 0.5) +
    theme(legend.position = "top",
          text = element_text(size = 14),
          axis.title = element_blank(),
          legend.title = element_text(face = "bold", size = 12)) + 
    scale_fill_manual(name = "Estimator", labels = c("AUG", "OLS", "HAC", "AUG-HAC"), 
                      values = c("#000000", "#222277", "#339933", "#DDDD33")) +
    guides(fill = guide_legend(nrow = 2, byrow = TRUE))
g_remove + g_levels + g_est

ggsave("histos.pdf", width = 9, height = 4)

Next, we plot the boxplots of t-statistics: this gives a broad view of their distribution.
We split the analysis according to variables and return horizon, and also differentiate between data frequencies.

g1 <- results |>
    ggplot(aes(x = variable, y = t_stat, color = sheet)) +
    geom_boxplot() + theme_minimal() +
    geom_hline(yintercept = 2, linetype = 2) + geom_hline(yintercept = -2, linetype = 2) +
    scale_color_manual(values = c("#000000", "#666666", "#BBBBBB"), name = "frequency") +
    theme(legend.position = c(0.5,0.85),
          legend.title = element_text(face = "bold"),
          text = element_text(size = 15),
          axis.title.y = element_blank(),
          axis.title.x = element_text(face = "bold"))

g2 <- results |>
    ggplot(aes(x = as.factor(horizon), y = t_stat, color = sheet)) +
    geom_boxplot() + theme_minimal() +
    geom_hline(yintercept = 2, linetype = 2) + geom_hline(yintercept = -2, linetype = 2) +
    scale_color_manual(values = c("#000000", "#666666", "#BBBBBB")) +
    theme(legend.position = "none",
          text = element_text(size = 15),
          axis.title.y = element_blank(),
          axis.title.x = element_text(face = "bold")) +
    xlab("return horizon (periods)")

g1 + g2

ggsave("map.pdf", width = 8, height = 5)

7 Hacking intervals & Lipschitz confirmation

The plot below shows the intervals when fixing exactly one mapping, while leaving all other mappings free. This generates many combinations, for each of the 28 possible fixed choices.

module_names <- c("sheet", "prob_remove", "threshold", "prob_levels", "variable", 
                  "horizon", "start_quantile", "stop_quantile", "estimator", "adjusted")

results |>
    mutate(across(all_of(module_names), as.character)) |>
    pivot_longer(cols = all_of(module_names), names_to = "module", values_to = "value") |>
    group_by(module, value) |>
    summarise(min = min(t_stat, na.rm = T),
              max = max(t_stat, na.rm = T)) |>
    mutate(variable = paste(module, value)) |>
    ggplot(aes(y = variable)) + geom_vline(xintercept = 0, linetype = 2) +
    geom_errorbarh(aes(xmin = min, xmax = max, color = module)) +
    theme_bw() + ylab("") + xlab("hacking interval (t-statistics)") +
    theme(legend.title = element_text(face = "bold"),
          axis.title.x = element_text(face = "bold")) +
    guides(color = guide_legend(reverse = TRUE)) +
    scale_color_manual(values = c("#3D6499", "#FCA906", "#FC062F", "#9457CF", "#777777", "#000000",  "#84D0B1", "#568751", "#7B6B4A", "#7DD5FE"))

ggsave("intervals.pdf", width = 8, height = 5)    
size_full <- vector(mode = "list", length = length(module_names) - 1)
for(j in 1:(length(module_names) - 1)){
    #print(j)
    combz <- combn(module_names, j) # All possible combinations
    for(k in 1:ncol(combz)){
        temp <- results |>
            group_by_at(combz[,k]) |>
            summarise(min = min(t_stat, na.rm = T),
                      max = max(t_stat, na.rm = T)) |>
            mutate(size = max - min)
        size_full[[j]] <- c(size_full[[j]], temp |> pull(size))
    }
}

First, some analysis.
Notably, we estimate the coefficients from \(y=a+b^n\), where \(y\) is the average range of hacking intervals and \(n\) is the number of “free” mappings.
We also examine the ratio between successive numbers of mappings.
We see that even after 5-7 free mappings, each new mapping extends the range of \(t\)-statistics by more than 30%.

data_combi <- reshape2::melt(size_full) |> filter(is.finite(value)) |> mutate(L1 = 10 - L1)
expanding_model <- lm(log(mean) ~ L1, 
                      data = data_combi |> 
                          group_by(L1) |> summarise(mean = mean(value, na.rm = T)))
summary(expanding_model)
## 
## Call:
## lm(formula = log(mean) ~ L1, data = summarise(group_by(data_combi, 
##     L1), mean = mean(value, na.rm = T)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36218 -0.07807  0.03954  0.14357  0.17907 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.23890    0.13829  -1.727    0.128    
## L1           0.35178    0.02458  14.314 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1904 on 7 degrees of freedom
## Multiple R-squared:  0.967,  Adjusted R-squared:  0.9622 
## F-statistic: 204.9 on 1 and 7 DF,  p-value: 1.932e-06
data_combi |> 
    group_by(L1) |> 
    summarise(ARI = mean(value, na.rm = T)) |> 
    mutate(rate = ARI / lag(ARI) - 1) |>
    rename("J-k" = L1) |>
    t() |>
    round(3) |>
    data.frame() |>
    xtable(digits = 3)
## % latex table generated in R 4.2.1 by xtable 1.8-4 package
## % Tue Jan 10 11:28:40 2023
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrrrrrrr}
##   \hline
##  & X1 & X2 & X3 & X4 & X5 & X6 & X7 & X8 & X9 \\ 
##   \hline
## J-k & 1.000 & 2.000 & 3.000 & 4.000 & 5.000 & 6.000 & 7.000 & 8.000 & 9.000 \\ 
##   ARI & 0.779 & 1.656 & 2.662 & 3.847 & 5.278 & 7.048 & 9.285 & 12.149 & 15.745 \\ 
##   rate &  & 1.124 & 0.608 & 0.445 & 0.372 & 0.335 & 0.317 & 0.308 & 0.296 \\ 
##    \hline
## \end{tabular}
## \end{table}

Finally, the plot that summarizes the findings.
Note that the figures at the top show the numbers of combinations of mappings over which the boxplots are drawn.

set.seed(42)
data_combi |>
    mutate(size = abs(L1-3.5)) |>
    ggplot(aes(x = as.factor(L1), y = value, color = L1)) + geom_boxplot() + 
    theme_bw() +
    xlab("Degrees of freedom (number of free mappings)") + 
    ylab("range of hacking interval") +
    scale_color_viridis_c(direction = -1) +
    theme(axis.title = element_text(face = "bold"),
          legend.position = "none")  +
    stat_summary(fun = mean, geom = "point", shape = 20, size = 6, fill = "black")+
    geom_function(fun = function(x) exp(expanding_model$coefficients[1]) * exp(expanding_model$coefficients[2])^x) +
    geom_label(data = data_combi |>  group_by(L1) |> summarize(n = n(), max = 20.2), 
               aes(x = as.factor(L1), y = max, label = n)) +
    annotate("rect", xmin = 0, xmax = 2.8, ymin = 5, ymax = 7, fill = "white", alpha = 0.5) + 
    annotate("text", x = 1.5, y = 6, label = "many small intervals") +
    annotate("text", x = 8.5, y = 2, label = "few large intervals")

ggsave("combinations.pdf", width = 8, height = 5)

8 Drivers of scalar outcomes

8.1 Pairwise tests

The chunk below performs paired Wilcoxon tests for all pairs of predictors.
The upper triangles pertains to tests on t-stats, while the lower triangle relates to tests on p-values.

wilcox_tests <- function(X){ # This function runs the Wilcoxon tests
    N <- ncol(X)
    v <- c()                   # Initialize output with empty object
    for(i in 1:(N-1)){         # Run across the first variable
        for(j in (i+1):N){       # Run across the second variable
            v <- c(v, wilcox.test(X[,i], X[,j], paired = TRUE)$p.value)
        }
    }
    v
}

test_t_stat <- results |>
    select(all_of(c(module_names, "t_stat"))) |>
    pivot_wider(names_from = variable, values_from = t_stat) |>
    select(all_of(unique(results$variable))) |>
    as.matrix() |>
    wilcox_tests()

test_p_val <- results |>
    select(all_of(c(module_names, "p_val"))) |>
    pivot_wider(names_from = variable, values_from = p_val) |>
    select(all_of(unique(results$variable))) |>
    as.matrix() |>
    wilcox_tests() 

variables <- unique(results$variable)
S <- diag(length(variables)) * NA
S[lower.tri(S, diag=F)] <- test_t_stat   # Filled by column
S <- t(S)                                # Hence the transpose
S[lower.tri(S, diag=F)] <- test_p_val
colnames(S) <- unique(results$variable)
rownames(S) <- unique(results$variable)

8.2 Simple mean tests

Below, we test if the average of t-stats and p-values linked to one predictor is different from that of all other predictors.

t_stat_test <- 0
p_val_test <- 0
for(j in 1:6){
    t_stat_test[j] <- t.test(
        results |> filter(variable %in% variables[j]) |> pull(t_stat),   # values of predictor j
        results |> filter(!(variable %in% variables[j])) |> pull(t_stat) # all other values
    )$p.value
    p_val_test[j] <- t.test(
        results |> filter(variable %in% variables[j]) |> pull(p_val),
        results |> filter(!(variable %in% variables[j])) |> pull(p_val)
    )$p.value
}
S <- cbind(S, t_stat_test)
S <- rbind(S, c(p_val_test,NA))
round(S,3) |> xtable(digits = 3)
## % latex table generated in R 4.2.1 by xtable 1.8-4 package
## % Tue Jan 10 11:28:48 2023
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrrrrr}
##   \hline
##  & payout & b/m & svar & dfr & dfy & ntis & t\_stat\_test \\ 
##   \hline
## payout &  & 0.000 & 0.000 & 0.000 & 0.000 & 0.000 & 0.011 \\ 
##   b/m & 0.002 &  & 0.000 & 0.000 & 0.000 & 0.000 & 0.000 \\ 
##   svar & 0.293 & 0.000 &  & 0.000 & 0.000 & 0.000 & 0.000 \\ 
##   dfr & 0.000 & 0.000 & 0.000 &  & 0.000 & 0.000 & 0.000 \\ 
##   dfy & 0.577 & 0.000 & 0.491 & 0.000 &  & 0.000 & 0.000 \\ 
##   ntis & 0.000 & 0.000 & 0.000 & 0.000 & 0.000 &  & 0.000 \\ 
##    & 0.036 & 0.000 & 0.000 & 0.000 & 0.012 & 0.000 &  \\ 
##    \hline
## \end{tabular}
## \end{table}

Average p-values for two predictors.

results |> filter(variable == "payout") |> pull(p_val) |> mean(na.rm = T)
## [1] 0.4615937
results |> filter(variable == "svar") |> pull(p_val) |> mean(na.rm = T)
## [1] 0.4726539

9 Focus on b/m and ntis

b/m and ntis are the two most promising variables. Let’s have a closer look at the corresponding \(t\)-statistics. For future use, we estimate their tail behavior beforehand.

t_stat_bm <- results |> 
    na.omit() |>
    filter(variable %in% c("b/m"), prob_levels == 1) |> 
    pull(t_stat)
generate_all_estimates(t_stat_bm[t_stat_bm > 1])
##           Method.of.Estimation Shape.Parameter Scale.Parameter
## 1  Maximum Likelihood Estimate       1.1398302        1.012727
## 2                Least Squares       1.5502124        1.012727
## 3            Method of Moments       1.5321005        1.012727
## 4           Percentiles Method       1.2240918        1.012727
## 5  Modified Percentiles Method       1.3901625        1.012727
## 6 Geometric Percentiles Method       0.9376195        1.012727
## 7       Weighted Least Squares       1.1300860        1.012727
generate_all_estimates(t_stat_bm[t_stat_bm > 1]) |> 
    pull(Shape.Parameter) |>
    mean()
## [1] 1.272015
t_stat_ntis <- results |> 
    na.omit() |>
    filter(variable %in% c("ntis"), prob_levels == 1) |> 
    pull(t_stat)
generate_all_estimates(-t_stat_ntis[t_stat_ntis < (-1)])
##           Method.of.Estimation Shape.Parameter Scale.Parameter
## 1  Maximum Likelihood Estimate        1.249048        1.000833
## 2                Least Squares        1.866523        1.000833
## 3            Method of Moments        1.648273        1.000833
## 4           Percentiles Method        1.597082        1.000833
## 5  Modified Percentiles Method        1.628415        1.000833
## 6 Geometric Percentiles Method        1.165006        1.000833
## 7       Weighted Least Squares        1.239427        1.000833
generate_all_estimates(-t_stat_ntis[t_stat_ntis < (-1)]) |> 
    pull(Shape.Parameter) |>
    mean()
## [1] 1.484825

We discriminate between levels and differences.

g1 <- results |> 
    filter(variable %in% c("b/m")) |>
    mutate(prob_levels = recode_factor(prob_levels, `0` = "Difference", `1` = "Level")) |>
    ggplot(aes(x = t_stat, fill = as.factor(estimator))) + geom_histogram(aes(y = ..density..), position = "dodge") + 
    facet_wrap(vars(prob_levels)) + 
    theme_bw() +
    theme(axis.title.x = element_text(face = "bold"),
          legend.position = "bottom",
          title = element_text(face = "bold"),
          axis.title.y = element_blank()) +
    scale_fill_viridis_d(option = "magma", end = 0.85, name = "estimator") + 
    geom_label(aes(x = 7.5, y = 0.2, label = "b/m"), fill = "white", size = 6) +
    #scale_fill_manual(values = c("#AAAACC", "#444444"), name = "Predictor") +
    #annotate("text", x = 7.5, y = 0.1, label = "strong\n effects") + 
    xlab("t-statistic")
g2 <- results |> 
    filter(variable %in% c("ntis")) |>
    mutate(prob_levels = recode_factor(prob_levels, `0` = "Difference", `1` = "Level")) |>
    ggplot(aes(x = t_stat, fill = as.factor(estimator))) + geom_histogram(aes(y = ..density..), position = "dodge") + 
    facet_wrap(vars(prob_levels)) + 
    theme_bw() +
    theme(axis.title.x = element_text(face = "bold"),
          legend.position = "none",
          title = element_text(face = "bold"),
          axis.title.y = element_blank()) +
    scale_fill_viridis_d(option = "magma", end = 0.85, name = "estimator") + 
    #scale_fill_manual(values = c("#AAAACC", "#444444"), name = "Predictor") +
    geom_label(aes(x = -7.5, y = 0.2, label = "ntis"), fill = "white", size = 6) +
    xlab("t-statistic") 
g1 / g2

ggsave("stable.pdf", width = 8, height = 5)

10 Model averaging

When several estimates have been obtained to quantify one effect, it is natural to seek to aggregate them.

10.1 Frequentist averaging

There are many ways to proceed, but we choose a very common form: \[\hat{b}_* =\sum_{j=1}^Jw_j\hat{b}_j\] with \[w_j= \frac{e^{-\Delta_j/2}}{\sum_{k=1}^Je^{-\Delta_k/2}}, \quad \Delta_j = AIC_j-\underset{j}{\min}AIC_j\]

For the estimation of the variance of the aggregate estimator, we follow Equation (1) in Burnham & Anderson (2004): \[\hat{\sigma}^2_*=\left(\sum_{j=1}^J w_j\sqrt{\hat{\sigma}_j + (\hat{b}_*-\hat{b}_j)^2} \right)^2.\]

Below, the confidence intervals are computed as \([\hat{b}_*-1.96\sigma_*^2/\sqrt{T_*},\hat{b}_*+1.96\sigma_*^2/\sqrt{T_*}]\), where \(T_*=\sum_{j=1}^Jw_jT_j\).

m_avg <- results |>
    filter(is.finite(coef)) |>
    group_by(variable, prob_levels, sheet) |>
    mutate(D = (aic - min(aic)),
           weight = if_else(is.na(D), 0, exp(-D/2) /sum(exp(-D/2), na.rm = T)),
           b = sum(weight * coef, na.rm = T),
           n = sum(weight * sample_size, na.rm = T)) |>
    summarise(s = sum(weight * sqrt(sd^2+(b-coef)^2), na.rm = T),
              b = mean(b),
              n = mean(n)) |>
    mutate(t = b / s * sqrt(n))

g_diff <- m_avg |>
    filter(prob_levels == 0) |>
    ggplot(aes(x = reorder(variable, b), y = b, color = sheet)) + 
    geom_point() + geom_hline(yintercept = 0, linetype = 2) +
    geom_errorbar(aes(ymin = b - 1.96 * s / sqrt(n), ymax = b + 1.96 * s / sqrt(n)), width = 0.5) + 
    theme_minimal() + ylab("coefficient (confidence interval)") +
    theme(legend.position = c(0.18,0.8),
          text = element_text(size = 12),
          title = element_text(face = "bold"),
          legend.title = element_text(face = "bold"),
          legend.background = element_rect(fill = "white", color = "white"),
          axis.title.x = element_blank()) +
    labs(color="Frequency",
         subtitle = "Difference preddictors")

g_levels <- m_avg |>
    filter(prob_levels == 1) |>
    ggplot(aes(x = reorder(variable, b), y = b, color = sheet)) + 
    geom_hline(yintercept = 0, linetype = 2) +  geom_point() + 
    geom_errorbar(aes(ymin = b - 1.96 * s / sqrt(n), ymax = b + 1.96 * s / sqrt(n)), width = 0.5) + 
    theme_minimal() + ylab("coefficient (confidence interval)") +
    theme(legend.position = c(0.18,0.8),
          text = element_text(size = 12),
          title = element_text(face = "bold"),
          legend.title = element_text(face = "bold"),
          legend.background = element_rect(fill = "white", color = "white"),
          axis.title.x = element_blank()) +
    labs(color="Frequency",
         subtitle = "Level preddictors")

g_levels + g_diff

ggsave("model_averaging.pdf", width = 8, height = 4)

10.2 Bayesian averaging

The quantity of interest is \(b\), with posterior probability given the data \(D\) equal to \[P[b | D]= \sum_{m=1}^MP[b|M_m,D]P[M_m|D],\] where \(\mathbb{M}=\{M_m,m=1,\dots,M\}\) is the set of models under consideration. One model corresponds to one complete path. Notably, the above equation translates to the following conditional average and variance: \[\begin{align} \mathbb{E}[b|D]&=\sum_{m=1}^M\hat{b}_mP[M_m|D] \label{eq:mean} \\ \mathbb{V}[b|D]&=\sum_{m=1}^M \left(\mathbb{V}(b|M_m,D) + \hat{b}^2_m \right)P[M_m|D] - (\mathbb{E}[b|D])^2 \end{align}\] where \(\hat{b}_m\) is the estimated effect in model \(m\). The posterior model probabilities are given by \[\begin{equation} P[M_m|D] = \left(\sum_{j=1}^M \frac{P[M_j]}{P[M_m]}\frac{l_D(M_j)}{l_D(M_m)} \right)^{-1}, \end{equation}\] with \(l_D(M_j)\) being the marginal likelihood of model \(j\). Because we are agnostic with respect to which model is more realistic, we will set the prior odds \(\frac{P[M_j]}{P[M_m]}\) equal to one. Note that in this case, the posterior probabilities are then simply proportional to the likelihoods. The remaining Bayes factor is by far the most complex: \[\begin{equation} \frac{l_D(M_j)}{l_D(M_m)}=\left(\frac{n_j}{n_j+1}\right)^{\frac{k_j}{2}}\left(\frac{n_m+1}{n_m}\right)^{\frac{k_m}{2}}\frac{\left(\frac{s_m+n_mv_m}{n_m+1} \right)^{(n_m-1)/2}}{\left(\frac{s_j+n_jv_j}{n_j+1} \right)^{(n_j-1)/2}}, \label{eq:odds} \end{equation}\] where \(n_j\) is the inverse of the number of observations used in model \(j\) and \(k_j\) is the number of predictors in this model, omitting the constant (\(k_j=1\) in our case). Moreover, \(n_jv_j\) is the sample variance of the dependent variable in model \(j\). Finally, \(s_j\) is the sum of squared residuals under model \(j\).

level.labs <- c("Difference predictor", "Level predictor")
names(level.labs) <- c("0", "1")

results |> 
    na.omit() |> # Likelihood below
    mutate(lik = sqrt(sample_size/(sample_size+1))/((sum_sq_err+var_y)/(sample_size+1))^((sample_size-1)/2)) |>
    filter(sheet == "Yearly") |>
    group_by(variable, prob_levels) |>
    summarize(p = lik / sum(lik),
              b = sum(coef * p),
              n = sum(sample_size * p),
              sdev = sqrt(sum((sd^2+coef^2)*p)-b^2)) |>
    ggplot(aes(x = reorder(variable, b))) + 
    geom_hline(yintercept = 0, linetype = 2) + 
    theme_minimal() + facet_wrap(vars(prob_levels), labeller = labeller(prob_levels = level.labs)) +
    geom_errorbar(aes(ymin = b - 3* sdev/sqrt(n), ymax = b + 3 * sdev/sqrt(n) , width = 0.3), color = "#AAAAAA" ) +
    geom_errorbar(aes(ymin = b - 2* sdev/sqrt(n), ymax = b + 2* sdev/sqrt(n) , width = 0.5), color = "#555555") +
    geom_errorbar(aes(ymin = b , ymax = b, width = 0.8), color = "black" ) +
    theme(axis.title.x = element_blank(),
          text = element_text(size = 14),
          strip.text = element_text(size = 14, face = "bold")) + ylab("coefficient") +
    geom_point(data = m_avg |> filter(sheet == "Yearly"), aes(y = b), color = "red") 

#ggsave("Bayes_avg.pdf", width = 8, height = 4)

11 Rejection rates

Below, we plot the estimated rejection rates across all tests, both for each predictor, and for the full sample (ALL, in black).

agg <- results |>
    summarise(RR_005 = mean(p_val < 0.005, na.rm = T),
              RR_010 = mean(p_val < 0.01, na.rm = T),
              RR_050 = mean(p_val < 0.05, na.rm = T),
              RR_100 = mean(p_val < 0.1, na.rm = T)) |>
    bind_rows(data.frame(RR_005 = 0.005, RR_010 = 0.01, RR_050 = 0.05, RR_100 = 0.1)) |>
    bind_cols(variable = c("ALL", "Uniform")) |>
    pivot_longer(-variable, names_to = "Threshold", values_to = "RR")
p1 <- results |>
    group_by(variable) |>
    summarise(RR_005 = mean(p_val < 0.005, na.rm = T),
              RR_010 = mean(p_val < 0.01, na.rm = T),
              RR_050 = mean(p_val < 0.05, na.rm = T),
              RR_100 = mean(p_val < 0.1, na.rm = T)) |>
    pivot_longer(-variable, names_to = "Threshold", values_to = "RR") |>
    bind_rows(agg) |>
    ggplot(aes(x = Threshold, y = RR, color = variable, group = variable)) + 
    geom_line() + geom_point(size = 2.5) + theme_minimal() +
    scale_color_manual(values = c("#000000", "#888888", "#CCCCCC", "#581845", "#C70099", "#FFC300", "#4C9129", "#1199FF")) +
    scale_x_discrete(labels = c(
        "RR_005" = "0.5%",
        "RR_010" = "1%",
        "RR_050" = "5%",
        "RR_100" = "10%"
    )) +
    theme(legend.position = c(0.3,0.85),
          text = element_text(size = 16),
          legend.title = element_text(face = "bold"),
          legend.background = element_rect(fill = "white", color = "white"),
          axis.title.y = element_blank()) +
    guides(color = guide_legend(ncol = 2))

p2 <- results |>
    filter(prob_levels == 1) |>
    group_by(variable) |>
    summarise(RR_005 = mean(p_val < 0.005, na.rm = T),
              RR_010 = mean(p_val < 0.01, na.rm = T),
              RR_050 = mean(p_val < 0.05, na.rm = T),
              RR_100 = mean(p_val < 0.1, na.rm = T)) |>
    pivot_longer(-variable, names_to = "Threshold", values_to = "RR") |>
    ggplot(aes(x = Threshold, y = RR, color = variable, group = variable)) + 
    geom_line() + geom_point(size = 2.5) + theme_minimal() +
    scale_color_manual(values = c("#888888", "#CCCCCC", "#581845", "#C70099", "#FFC300", "#4C9129", "#1199FF")) +
    scale_x_discrete(labels = c(
        "RR_005" = "0.5%",
        "RR_010" = "1%",
        "RR_050" = "5%",
        "RR_100" = "10%"
    )) +
    theme(legend.position = c(0.25,0.85),
          text = element_text(size = 16),
          legend.title = element_text(face = "bold"),
          legend.background = element_rect(fill = "white", color = "white"),
          axis.title.y = element_blank()) +
    guides(color = guide_legend(ncol = 2))
p1 + p2

#ggsave("RR.pdf", width = 9, height = 4.5)

12 Additional material

On ntis & b/m, from a previous version of the paper.

results |>
    filter(variable %in% c("b/m", "ntis")) |>
    ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) + geom_vline(xintercept = 0, linetype = 2) +
    geom_histogram(alpha = 0.7, position = "dodge") +
    facet_wrap(vars(variable)) + theme_minimal() +
    xlab("t-statistic") + 
    theme(legend.position = c(0.5,0.8),
          legend.title = element_text(face = "bold"),
          legend.background = element_rect(fill = "white", color = "white")) +
    scale_fill_manual(name = "predictor", labels = c("Differences", "Levels"), values = c("#4DA2F1", "#F1A44D"))

Finally, we illustrate the differences with a jitter plot.

set.seed(42) # Random seed - for reproducibility
results |>
    filter(variable %in% c("b/m", "ntis")) |>
    mutate(horizon = as.factor(horizon),
           horizon = recode_factor(horizon,
                                   `1` = "1 period",
                                   `3` = "3 periods",
                                   `6` = "6 periods",
                                   `12` = "12 periods")) |>
    ggplot(aes(x = sheet, y = t_stat,  color = as.factor(prob_levels))) + 
    geom_hline(yintercept = 0, linetype = 3, color = "grey") +
    geom_hline(yintercept = -2, linetype = 2) +
    geom_hline(yintercept = 2, linetype = 2) +
    scale_x_discrete() +
    geom_jitter(alpha = 0.7) +
    facet_grid(vars(variable), vars(horizon), scales = "free") + theme_minimal() +
    ylab("t-statistic") + 
    theme(legend.title = element_text(face = "bold"),
          axis.title.x = element_blank(),
          strip.text = element_text(face = "bold"),
          legend.position = c(0.15,0.12),
          text = element_text(size = 13),
          legend.background = element_rect(fill = "white", color = "white")) +
    scale_color_manual(name = "predictor", labels = c("Differences", "Levels"), values = c("#4DA2F1", "#F1A44D")) #+

#geom_label(data = table, aes(label = t), show_guide = FALSE)
#ggsave("focus.pdf", width = 9, height = 4)

A look at tails of t-stat distribution

g1 <- results |> 
    filter(variable %in% c("b/m")) |>
    mutate(prob_levels = recode_factor(prob_levels, `0` = "Difference", `1` = "Level")) |>
    ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) + geom_histogram(aes(y = ..density..), position = "dodge") + 
    theme_bw() +
    theme(axis.title.x = element_text(face = "bold"),
          legend.position = "none",
          title = element_text(face = "bold"),
          axis.title.y = element_blank()) +
    scale_fill_manual(values = c("#AAAACC", "#444444"), name = "Predictor") +
    annotate("text", x = 7.5, y = 0.1, label = "strong\n effects") + 
    xlab("t-statistic") + ggtitle("b/m") +
    geom_function(fun = dstable, colour = "red", 
                  args = list(alpha = 1.24, beta = 1, gamma = 1, 
                              delta = results |> filter(variable %in% c("b/m"), prob_levels == 1) |> pull(t_stat) |> mean(na.rm=T)))
g2 <- results |> 
    filter(variable %in% c("ntis")) |>
    mutate(prob_levels = recode_factor(prob_levels, `0` = "Difference", `1` = "Level")) |>
    ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) + geom_histogram(aes(y = ..density..),position = "dodge") + 
    theme_bw() +
    theme(legend.position = c(0.3,0.8),
          axis.title.x = element_text(face = "bold"),
          title = element_text(face = "bold"),
          axis.title.y = element_blank()) +
    scale_fill_manual(values = c("#AAAACC", "#444444"), name = "Predictor") +
    annotate("text", x = -7.5, y = 0.1, label = "strong\n effects") + 
    xlab("t-statistic") + ggtitle("ntis") +
    geom_function(fun = dstable, colour = "red", 
                  args = list(alpha = 1.45, beta = -1, gamma = 1, 
                              delta = results |> filter(variable %in% c("ntis"), prob_levels == 1) |> pull(t_stat) |> mean(na.rm=T)))
g1 + g2

Below, we estimate the tail exponent of errors in simple monthly predictive regressions.
Left and right tails are treated separately.

x <- lag(as.numeric(data_month$`b/m`))
y <- data_month$Index/lag(data_month$Index)-1
data_bm <- data.frame(x = x, y = y) |> na.omit()
res <- lm(data_bm$y ~ data_bm$x)$residuals                             # Residuals from model with b/m predictor
table <- generate_all_estimates(-res[res < 0])[,1:2]
colnames(table) <- c("Method", "b/m, left")
table <- table |>
    bind_cols(`b/m, right` = generate_all_estimates(res[res > 0])$Shape.Parameter)

x <- lag(as.numeric(data_month$`ntis`))
y <- data_month$Index/lag(data_month$Index)-1
res <- lm(y ~ x)$residuals                             # Residuals from model with b/m predictor

table <- table |>
    bind_cols(`ntis, left` = generate_all_estimates(-res[res < 0])$Shape.Parameter) |>
    bind_cols(`ntis, right` = generate_all_estimates(res[res > 0])$Shape.Parameter)

#xtable(table, digits = 3)
table
##                         Method b/m, left b/m, right ntis, left ntis, right
## 1  Maximum Likelihood Estimate 0.2428556  0.2187568  0.1161515   0.1456144
## 2                Least Squares 0.7035374  0.7174895  0.5567472   0.6551829
## 3            Method of Moments 1.0100470  1.0066174  1.0001024   1.0006469
## 4           Percentiles Method 0.7717784  0.9062372  0.7549434   0.8492861
## 5  Modified Percentiles Method 1.0634475  1.4983419  1.0923893   1.3901179
## 6 Geometric Percentiles Method 0.4934899  0.5580351  0.4567327   0.5277124
## 7       Weighted Least Squares 0.2410709  0.2173651  0.1152814   0.1446107

Another look at the results

results |> 
    ggplot(aes(x = t_stat, fill = as.factor(prob_levels))) +
    geom_histogram(position = "dodge") + theme_bw() + 
    facet_wrap(vars(variable), scales = "free") +
    xlab("t-statistic") + 
    theme(legend.position = c(0.44,0.85),
          legend.title = element_text(face = "bold"),
          axis.title.y = element_blank()) +
    scale_fill_manual(values = c("#1155DD", "#FFBB11"), name = "regressor", labels = c("differences", "levels"))

ggsave("hist_vars.pdf", width = 8, height = 5)